The aim of this project is to: (1) build a sales price prediction model using a training data filled set of houses from 2006-2009 within the Ames Housing data set; (2) Use this model to predict sales price for a testing data set of houses from 2010 within the Ames Housing data set; (3) Apply this model to the creation of a renovation calculator that estimates the added value to a house when certain renovations are made.
In determining the model that will be used in the creation of the renovation calculator, we’ll first determine which variables would be the best predictors of sale price and then apply that chosen subset of variables to various models. We can then compare the performance of the various models and choose the one with the best performance. The model with the best performance will help us examine how the sale price will change if a home had renovated their kitchen, bathroom, basement, or roof.
This project used data from 1500 residential property sales in Ames, Iowa between 2006 and 2012. There are 82 explanatory variables in the data set, containing - nominal, ordinal, discrete, and continuous attributes. Continuous variables provide information about the multiple area dimensions of the house and property, such as the size of the lot, garage among others. Discrete variables, on the other hand, quantify characteristics of the house/properties like the number of kitchens, baths, bedrooms, and parking spots. Nominal variables, generally, describe the multiple types of materials and locations, such as the name of the neighborhood or the type of foundations. Ordinal variables typically rate the condition and quality of multiple house characteristics and utilities.
Prior to doing the exploratory data analysis, we hypothesize that the following variables will be the most predictive of home price: lot area, home type, year built, and overall quality. We think these will be the most predictive because we assume that if we were to be in the market for a home, these would be among the top criteria we would consider when deciding which home to purchase.
Furthermore, we also hypothesize that a generalized additive model (GAM) will be the best model to use. We think so because the GAM will be able to combine the strengths of various different other model types including polynomials, cubic splines, and smoothing splines.
Since our goal is to predict sale price, we first looked at the distribution of sale price in our data set.
What we observe from Figure 1 is that the distribution for sale price is right skewed. There a few houses with in the data set that tend to have relatively high prices. This is a limitation that we will further discuss in our limitation section of our discussion. We then proceed to analysis trend data for sale price. More, specifically, we explored how sale price varied acrous the year houses were sold.
Figure 2. allows us to see that the relationship between year sold and sale price is linear. Overall, it seems that there is an upward trend in sale price since the 1940s. We can also observe outliers across time. Our next step in our exploratory data analysis is to explore the variables we hypothesize will be strong predictors of price. We began by first exploring the variables themselves and then explore the relationship between these variables and sale price.
When it comes to lot area, this dataset has many outliers as shown above. We found that there were 127 outliers greater than the minimum outlier value of 17755. As these made visualization difficult, we temporarily removed them. After removing the outliers, we can see that homes have a somewhat normal distribution in terms of lot area near the median of 9436.5 square feet.
From Figure 3, we see that 1-story homes that were built in 1946 or later make up the bulk of our dataset, specifically 1079. This is over one-third of our total dataset which has 2930 observations. Please not that the graphs are interactive so move your cursor over the graph to see more details.
Furthermore, we can also observe from Figure 4, that most homes were built within a 5 year time range of 2005.
Exploring kitchen quality, from the table below, we can observe that the mean kitchen quality in this data is 3.51.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.511 4.000 5.000
We can observe from Figure 5 that there is a large variation in sale price across across different neighborhoods. Even within neighborhood we also see variation. Investigating some housing characteristics may give us insight into the variation observed in price within neighborhoods.
We first examined overall quality (Figure 6) and - as expected - price increases as overall quality increases. Examining year built (Figure 7), we observe that the the newer a home is, the higher its price, on average.
In addition investigating the relationship between sale price with location, overall quality, and age of the house, we also examined at the relationship between sale price and home type. We find that 2 story homes built in the year 1946 or later have the highest median home prices (Figure 8).
Figure 9 explores the relationship between kitchen quality and sale price.The higher the kitchen quality the higher the median sale price. This increase, however, is non-linear (but rather quadratic). From Figure 10, we can see that - as expected - there is a gradual positive relationship between lot area and sales price.
Missing data:
We opted for removing any missing observations from our final data set that were used for variable selection and modeling.
Modifying variable class:
We decided to keep the quality variables selected as a continuous variable as opposed to switching it to a factor. We did so because changing it to a factor would have lead to us dropping the “Very Poor” or “1” factor level as this level only has around 4 observations. By keeping the variable continuous, we are able to keep these observations and so better predict the home prices of homes that fall under this category.
Model Selection:
We began our model selection by reducing the number of variables within our housing data set. We created a subset data set that included the variables we hypothesized would important predictors of sale price.
These variables include:
LotArea: Lot size in square feetOverallQual: Rates the overall material and finish of the houseYearBuilt: Original construction dateExterior1st: Exterior covering on houseHeatingQC: Heating quality and conditionFoundation: Type of foundationTotRmsAbvGrd: Total rooms above grade (does not include bathrooms)KitchenQual: Kitchen qualityBsmtFinType1: Rating of basement finished areaNeighborhood: Physical locations within Ames city limitsLandSlope: Slope of propertyStreet: Type of road access to propertyHouseStyle: Style of dwellingGarageQual: Garage qualityFence: Fence qualityYrSold: Year Sold (YYYY)We further included additional variables that will be utilized later in the report to create a renovation calculator.
FullBath: Full bathrooms above gradeRoofStyle: Type of roofUsing our subset, we ran 1) a subset selection, (2) forward stepwise selection and (3) a forward stepwise selection for our variable selection. The graphs below are graphs that plot the number of variables against the BIC value for our three methods of variable selection.
Across all variable selection method, the a model with 7 variables has the lowest bIC score. Comparing the variables included in a model with seven variables across the three selection methods, we see that they all share the same variables.
| x |
|---|
| (Intercept) |
| tot_rms_abv_grd |
| overall_qual |
| lot_area |
| Bsmt.Qual |
| Kitchen.Qual |
| NeighborhoodNorthridge |
| BsmtFin.Type.1Unf |
| x |
|---|
| (Intercept) |
| tot_rms_abv_grd |
| overall_qual |
| lot_area |
| Kitchen.Qual |
| NeighborhoodNorthridge |
| NeighborhoodNorthridge Heights |
| BsmtFin.Type.1Unf |
| x |
|---|
| (Intercept) |
| tot_rms_abv_grd |
| overall_qual |
| lot_area |
| Kitchen.Qual |
| NeighborhoodNorthridge |
| NeighborhoodNorthridge Heights |
| BsmtFin.Type.1GLQ |
Following our variable selection analysis, we proceeded to use those variables to fit a GAM model and Linear model to help us predict sale price.
We began by creating a 10-fold CV error estimates for polynomial regression, cubic splines, and smoothing splines models. The graphs below show the results of the cross validation, allowing us to determine the model and degrees of freedom that best fit the relationship between our selected numerical variables and sale price.
A degree 2 smoothing spline appears to be the best model choice for lot area. It has the lowest CV error and the lowest has the most stable curve.
A degree 6 smoothing spline appears to be the best fit for the total rooms above grade variable. While a lower degree cubic spine is comparable, the cubic spline becomes more unstable at higher degrees.
A degree 6 smoothing spline appears to be a good fit here, however other models appear to do comparably as well.
A quadratic polynomial appear to be the best fit for this model as it has the lowest error.
A cubic spline with 8 degrees of freedom appears to be the best model in this case. Other models are close in CV error and are fairly stable, but the cubic spline model has the lowest error.
The plot suggest that model that has the lowest cv error is a smoothing spline with 4 degrees of freedom
| model | RMSE | MAE |
|---|---|---|
| linear | 33512.93 | 23495.54 |
| gam | 31157.1 | 21034.14 |
Our hypothesis on model selection was correct. Examining RMSE and MAE for both the linear and gam models[2] we can observe that for both metrics the gam model out performs the linear model.
##
## Call: gam(formula = saleprice ~ s(lot_area, 2) + s(tot_rms_abv_grd,
## 6) + s(overall_qual, 6) + poly(Kitchen.Qual, 2) + bs(year_built,
## 8) + s(full_bath_abv_grd, 4) + Neighborhood + full_bath_abv_grd +
## Roof.Style + BsmtFin.Type.1, data = training)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -319638 -14814 -908 13116 211142
##
## (Dispersion Parameter for gaussian family taken to be 987751789)
##
## Null Deviance: 15347390784124 on 2391 degrees of freedom
## Residual Deviance: 2296522858852 on 2325 degrees of freedom
## AIC: 56396.82
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value
## s(lot_area, 2) 1 875284022212 875284022212 886.1376
## s(tot_rms_abv_grd, 6) 1 3113363822414 3113363822414 3151.9698
## s(overall_qual, 6) 1 6394247223301 6394247223301 6473.5365
## poly(Kitchen.Qual, 2) 2 324693372011 162346686005 164.3598
## bs(year_built, 8) 8 301174084064 37646760508 38.1136
## s(full_bath_abv_grd, 4) 1 26183048366 26183048366 26.5077
## Neighborhood 27 464638877046 17208847298 17.4222
## Roof.Style 5 18799865768 3759973154 3.8066
## BsmtFin.Type.1 6 137947181826 22991196971 23.2763
## Residuals 2325 2296522858852 987751789
## Pr(>F)
## s(lot_area, 2) < 0.00000000000000022 ***
## s(tot_rms_abv_grd, 6) < 0.00000000000000022 ***
## s(overall_qual, 6) < 0.00000000000000022 ***
## poly(Kitchen.Qual, 2) < 0.00000000000000022 ***
## bs(year_built, 8) < 0.00000000000000022 ***
## s(full_bath_abv_grd, 4) 0.0000002845 ***
## Neighborhood < 0.00000000000000022 ***
## Roof.Style 0.001949 **
## BsmtFin.Type.1 < 0.00000000000000022 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(lot_area, 2) 1 81.966 < 0.00000000000000022 ***
## s(tot_rms_abv_grd, 6) 5 23.867 < 0.00000000000000022 ***
## s(overall_qual, 6) 5 50.247 < 0.00000000000000022 ***
## poly(Kitchen.Qual, 2)
## bs(year_built, 8)
## s(full_bath_abv_grd, 4) 3 28.924 < 0.00000000000000022 ***
## Neighborhood
## full_bath_abv_grd
## Roof.Style
## BsmtFin.Type.1
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the summary output for the gam model, all of our variables are statistically significant at least to the p=.01 which suggests that these variables are relevant predictors for saleprice. This goes in line with part of out hypothesis that lot_area and overall_qual would be a statically significant predictors of saleprice. Contrary to our hypothesis, home_type and overall_qual are not statically significant predictors of saleprice.
Though the gam model had both a lower RMSE and MAE than the linear model and also better accommodates the flexibility in predictors, it is substantially more difficult to interpret the impact of each predictor on sale price. However, we do know based on the ANOVA parametric and ANOVA nonparametric output in the summary which variables are statistically significant in the model.
Below is a plot of the smooth variables.
| Neighborhood | MAPE |
|---|---|
| Northpark Villa | 0.0273512 |
| College Creek | 0.0736748 |
| Briardale | 0.0772398 |
| Sawyer | 0.0775484 |
| Mitchell | 0.0913307 |
| South & West Iowa State University | 0.0944263 |
| Brookside | 0.0958083 |
| Northwest Ames | 0.1025699 |
| NAmes | 0.1055915 |
| Somerset | 0.1066034 |
| Meadow Village | 0.1130381 |
| Gilbert | 0.1139046 |
| Sawyer West | 0.1220453 |
| Crawford | 0.1383723 |
| Greens | 0.1409893 |
| Timberland | 0.1467829 |
| Northridge Heights | 0.1507588 |
| Clear Creek | 0.1508536 |
| Northridge | 0.1524669 |
| Stone Brook | 0.1721457 |
| Bluestem | 0.1757224 |
| Edwards | 0.1830974 |
| Bloomington Heights | 0.1944211 |
| Old Town | 0.2612928 |
| Iowa DOT and Rail Road | 0.2640252 |
Based on the calculating the Mean Absolute Precentage Error (MAPE), we found that our model is better at predicting some neighborhoods like Northpark Villa (0.027) and South & West Iowa State University (0.053), but worse at predicting sale price for other neighborhoods like Iowa DOT and Rail Road (0.266) and Bloomington Heights (0.193).
Our goal was to create a baseline that would be the worst most common house. We calculated this approximation by taking the lowest number of full baths, the lowest kitchen quality, an unfinished basement, the mode of all the other variables in our dataset.
To determine the cost of improvements in full baths above grade, kitchen, roof, and basement we created a base sale price for comparison. This sale price for our base comparison consists of the following characteristics:
Based on our renovation calculator:
If you change your roof type from the most common roof type to any other roof type, on average, your house will go down $11124.3795492.
If you you go from an unfinished basement to any type of finished basement, our model predicts that on average, your house value will go down by $1084.6420984.
If you you make any upgrade to the kitchen from a kitchen with a quality 0- on average- your house value will go up by $34580.2351596.
If you you make any number of full bathrooms above grade (when you started with zero) - on average- your house value will go up by $49657.563354.
We then used our estimated costs from our renovation calculator to predict a new sale price for 2010 houses. We created four subsets dataframes: housing with the lowest kitchen quality, unfinished basement, gable roof style and 0 bathroom.
The predicted sales prices are displayed below.
lq.kitchen.df
## saleprice New_Saleprice
## 1 107500 142080.2
lq.bathroom.df
## saleprice New_Saleprice
## 1 144000 193657.6
## 2 260000 309657.6
head(lq.basement.df)
## saleprice New_Saleprice
## 1 189000 187915.36
## 2 175900 174815.36
## 3 180400 179315.36
## 4 88000 86915.36
## 5 120000 118915.36
## 6 376162 375077.36
head(lq.roof.df)
## saleprice New_Saleprice
## 1 105000 93875.62
## 2 189900 178775.62
## 3 195500 184375.62
## 4 213500 202375.62
## 5 191500 180375.62
## 6 236500 225375.62
Particular Home Improvement Recommendations
If you have the lowest quality kitchen, you should renovate it and upgrade to any better quality kitchen
If you have zero full bathrooms above grade, you should renovate and add at least 1 to get an increase of diff.bathroom on average
If you you have Gable roof, you should not renovate and choose roof type
Lasso: We were unable to calculate the CV error associated with the lasso variable selection method. As our prior experience with lasso was limited to working with primarily quantitative variables, we were stuck when we got errors when trying to calculate the error from the qualitative variables selected by lasso. While we are confident that this is possible to do, we were simply unfamiliar with the precise syntax. However, several of the variables that lasso recommended were also recommended by our other variable selection methods such as lot area and total rooms above grade.
GAM: We had trouble figuring out how to use GAM to predict the price changes caused by a specific renovation on average. For example, if a client asked us how much the price of home would change on average if we added an additional bathroom, then we were unsure how exactly to estimate this (though we understood how to do this with a normal linear regression). On the other hand, we felt that GAM could be put to better use at predicting renovation price changes when a client gave us all the specifications of a particular house and then wanted to change one part of it (we would simply put all the information into a single row and run the predict.Gam function on it).
Renovation Calculator: Our renovator calculator does seem to at least pick up on the fact that improving the kitchen or going from 0 to any number of full bathrooms above grade will add to your sale price value. However, our calculator predicts that finishing an unfinished basement in any capacity will decrease the sale price on average, which is unusual. This might suggest that the gam model doesn’t predict changes in basement type very well.
Chosen Subset of Variables: When attempting to determine a the variables that were the best predictors of sale price with the fwd, bwd, and subset methods, using a nvmax > 7 would cause R to restart so we had could only choose the best subset up to 7 variables.
Sale Price: It is important to note that the sale price is right skewed which does impact the performance of our model on data outside of Ames Housing data set. We did not implement any modifications to address the skew.